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Abstract 

Convolutional neural nets (CNNs) have demonstrated 
remarkable performance in recent history. Such approaches 
tend to work in a “unidirectional” bottom-up feed-forward 
fashion. However, practical experience and biological ev¬ 
idence tells us that feedback plays a crucial role, particu¬ 
larly for detailed spatial understanding tasks. This work 
explores “bidirectional” architectures that also reason with 
top-down feedback: neural units are influenced by both 
lower and higher-level units. 

We do so by treating units as rectified latent variables 
in a quadratic energy function, which can be seen as a hi¬ 
erarchical Rectified Gaussian model (RGs) [39]. We show 
that RGs can be optimized with a quadratic program (QP), 
that can in turn be optimized with a recurrent neural net¬ 
work (with rectified linear units). This allows RGs to be 
trained with GPU-optimized gradient descent. From a the¬ 
oretical perspective, RGs help establish a connection be¬ 
tween CNNs and hierarchical probabilistic models. From a 
practical perspective, RGs are well suited for detailed spa¬ 
tial tasks that can benefit from top-down reasoning. We il¬ 
lustrate them on the challenging task of keypoint localiza¬ 
tion under occlusions, where local bottom-up evidence may 
be misleading. We demonstrate state-of-the-art results on 
challenging benchmarks. 


1. Introduction 


Hierarchical models of visual processing date back to 
the iconic work of Marr [3TJ. Convolutional neural nets 
(CNN’s), pioneered by LeCun et al. (27), are hierarchical 
models that compute progressively more invariant represen¬ 
tations of an image in a bottom-up, feedforward fashion. 
They have demonstrated remarkable progress in recent his¬ 
tory for visual tasks such as classification [ 25f38| 43|, object 
detection and image captioning p2| , among others. 

Feedback in biology: Biological evidence suggests that 
vision at a glance tasks, such as rapid scene categoriza¬ 
tion | [48| , can be effectively computed with feedforward hi¬ 
erarchical processing. However, vision with scrutiny tasks, 
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Figure 1 : On the top, we show a state-of-the-art multi-scale 
feedforward net, trained for keypoint heatmap prediction, 
where the blue keypoint (the right shoulder) is visualized 
in the blue plane of the RGB heatmap. The ankle keypoint 
(red) is confused between left and right legs, and the knee 
(green) is poorly localized along the leg. We believe this 
confusion arises from bottom-up computations of neural ac¬ 
tivations in a feedforward network. On the bottom, we in¬ 
troduce hierarchical Rectified Gaussian (RG) models that 
incorporate top-down feedback by treating neural units as 
latent variables in a quadratic energy function. Inference 
on RGs can be unrolled into recurrent nets with rectified 
activations. Such architectures produce better features for 
“vision-with-scrutiny” tasks ED (such as keypoint predic¬ 
tion) because lower-layers receive top-down feedback from 
above. Leg keypoints are much better localized with top- 
down knowledge (that may capture global constraints such 
as kinematic consistency). 


such as fine-grained categorization J23| or detailed spatial 
manipulations fl9| , appear to require feedback along a “re¬ 
verse hierarchy |17|. Indeed, most neural connections in 
the visual cortex are believed to be feedback rather than 
feedforward (4j[26). 
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Feedback in computer vision: Feedback has also 
played a central role in many classic computer vision mod¬ 
els. Hierarchical probabilistic models (20||28][55)> allow 
random variables in one layer to be naturally influenced 
by those above and below. For example, lower layer vari¬ 
ables may encode edges, middle layer variables may encode 
parts, while higher layers encode objects. Part models (5) 
allow a face object to influence the activation of an eye part 
through top-down feedback, which is particularly vital for 
occluded parts that receive misleading bottom-up signals. 
Interestingly, feed-forward inference on part models can be 
written as a CNN (9), but the proposed mapping does not 
hold for feedback inference. 

Overview: To endow CNNs with feedback, we treat 
neural units as nonnegative latent variables in a quadratic 
energy function. When probabilistically normalized, our 
quadratic energy function corresponds to a Rectified Gaus¬ 
sian (RG) distribution, for which inference can be cast as a 
quadratic program (QP) J39) . We demonstrate that coordi¬ 
nate descent optimization steps of the QP can be “unrolled” 
into a recurrent neural net with rectified linear units. This 
observation allows us to discriminatively-tune RGs with 
neural network toolboxes: we tune Gaussian parameters 
such that, when latent variables are inferred from an image, 
the variables act as good features for discriminative tasks. 
From a theoretical perspective, RGs help establish a con¬ 
nection between CNNs and hierarchical probabilistic mod¬ 
els. From a practical perspective, we introduce RG variants 
of state-of-the-art deep models (such as VGG16 (38)) that 
require no additional parameters, but consistently improve 
performance due to the integration of top-down knowledge. 

2. Hierarchical Rectified Gaussians 

In this section, we describe the Rectified Gaussian mod¬ 
els of Socci and Seung (39) and their relationship with rec¬ 
tified neural nets. Because we will focus on convolutional 
nets, it will help to think of variables z = [zi] as orga¬ 
nized into layers, spatial locations, and channels (much like 
the neural activations of a CNN). We begin by defining a 
quadratic energy over variables z: 

S(z) = iz T Wz + b T z (1) 

P(z) oc e 5(z) 

Boltzmann: zi £ {0,1 },wa = 0 
Gaussian: z % £ R, —W is PSD 
Rect. Gaussian: Zi £ i? + , — W is copositive 

where W = [wij], b = [bf. The symmetric matrix W cap¬ 
tures bidirectional interactions between low-level features 
(e.g., edges) and high-level features (e.g., objects). Prob¬ 
abilistic models such as Boltzmann machines, Gaussians, 
and Rectified Gaussians differ simply in restrictions on the 



Figure 2: A hierarchical Rectified Gaussian model where 
latent variables are denoted by circles, and arranged into 
layers and spatial locations. We write x for the input image 
and Wi for convolutional weights connecting layer i — 1 to i. 
Lateral inhibitory connections between latent variables are 
drawn in red. Layer-wise coordinate updates are computed 
by filtering, rectification, and non-maximal suppression. 

latent variable - binary, continuous, or nonnegative. Hier¬ 
archical models, such as deep Boltzmann machines (36) , 
can be written as a special case of a block-sparse matrix W 
that ensures that only neighboring layers have direct inter¬ 
actions. 

Normalization: To ensure that the scoring function can 
be probabilistically normalized, Gaussian models require 
that (-W) be positive semidefinite (PSD) (— z T Wz > 
0, \/z) Socci and Seung [ [39) show that Rectified Gaussians 
require the matrix (-W) to only be copositive (~z T Wz > 
0, \/z > 0), which is a strictly weaker condition. Intuitively, 
copositivity ensures that the maximum of S(z) is still fi¬ 
nite, allowing one to compute the partition function. This 
relaxation significantly increases the expressive power of a 
Rectified Gaussian, allowing for multimodal distributions. 
We refer the reader to the excellent discussion in (39) for 
further details. 

Comparison: Given observations (the image) in the 
lowest layer, we will infer the latent states (the features) 
from the above layers. Gaussian models are limited in that 
features will always be linear functions of the image. Boltz¬ 
mann machines produce nonlinear features, but may be lim¬ 
ited in that they pass only binary information across lay¬ 
ers (33) . Rectified Gaussians are nonlinear, but pass contin¬ 
uous information across layers: z t encodes the presence or 
absence of a feature, and if present, the strength of this acti¬ 
vation (possibly emulating the firing rate of a neuron ED)- 

Inference: Socci and Seung point out that MAP estima¬ 
tion of Rectified Gaussians can be formulated as a quadratic 
program (QP) with nonnegativity constraints (39) : 

max \z T W z + b T z (2) 

£>0 2 

However, rather than using projected gradient descent 
(as proposed by (39)), we show that coordinate descent 
is particularly effective in exploiting the sparsity of W. 
Specifically, let us optimize a single Zi holding all others 
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fixed. Maximizing a 1-d quadratic function subject to non¬ 
negative constraints is easily done by solving for the opti¬ 
mum and clipping: 

ma xf(zi) where f(zi) = \w u zf + (b, + 'S^w ij z j ) z i 

Zi> 0 z z ' 

j/* 

= WiiZi + bj + = 0 

2 ^ =-max(0, bi + w ij z j) (3) 

^zz ... 

J 7 ^ 

= max(0, bi + w ij z j) f° r ^zz = —1 

By fixing wu = —1 (which we do for all our experiments), 
the above maximization can solved with a rectified dot- 
product operation. 

Layerwise-updates: The above updates can be per¬ 
formed for all latent variables in a layer in parallel. With 
a slight abuse of notation, let us define the input image to 
be the (observed) bottom-most layer x = zo, and the vari¬ 
able at layer i and spatial position u is written as Zi [u] . The 
weight connecting Zi-%[v] to zi[u] is given by Wi [r], where 
r = u — v depends only on the relative offset between u 
and v (visualized in Fig. [2]): 

Zi [u] = max(0, bi + topi[u\ + boti[u}) where (4) 

topi[u] = 'Y^Wi + i[T]z i+ i[u - T} 

T 

boti [u] [r\zi -1 [u + t] 

T 

where we assume that layers have a single one-dimensional 
channel of a fixed length to simplify notation. By tying 
together weights such that they only depend on relative 
locations, bottom-up signals can be computed with cross- 
correlational filtering, while top-down signals can be com¬ 
puted with convolution. In the existing literature, these are 
sometimes referred to as deconvolutional and convolutional 
filters (related through a 180° rotation) j53j. It is natural 
to start coordinate updates from the bottom layer zi, ini¬ 
tializing all variables to 0. During the initial bottom-up co¬ 
ordinate pass, topi will always be 0. This means that the 
bottom-up coordinate updates can be computed with simple 
filtering and thresholding. Hence a single bottom-up pass of 
layer-wise coordinate optimization of a Rectified Gaussian 
model can be implemented with a CNN. 

Top-down feedback: We add top-down feedback sim¬ 
ply by applying additional coordinate updates 0 in a top- 
down fashion, from the top-most layer to the bottom. Fig. [3] 
shows that such a sequence of bottom-up and top-down 
updates can be “unrolled” into a feed-forward CNN with 
“skip” connections between layers and tied weights. One 
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Figure 3: On the left, we visualize two sequences of layer- 
wise coordinate updates on our latent-variable model. The 
first is a bottom-up pass, while the second is a bottom-up 
+ top-down pass. On the right, we show that bottom-up 
updates can be computed with a feed-forward CNN, and 
bottom-up-and-top-down updates can be computed with an 
“unrolled” CNN with additional skip connections and tied 
weights (which we define as a recurrent CNN). We use T 
to denote a 180° rotation of filters that maps correlation to 
convolution. We follow the color scheme from Fig.[2| 


can interpret such a model as a recurrent CNN that is ca¬ 
pable of feedback, since lower-layer variables (capturing 
say, edges) can now be influenced by the activations of 
high-layer variables (capturing say, objects). Note that we 
make use of recurrence along the depth of the hierarchy, 
rather than along time or spacial dimensions as is typically 
done 0 When the associated weight matrix W is coposi¬ 
tive, an infinitely-deep recurrent CNN must converge to the 
solution of the QP from {2}. 

Non-maximal suppression (NMS): To encourage 
sparse activations, we add lateral inhibitory connections 
between variables from same groups in a layer. Specifi¬ 
cally, we write the weight connecting Zi[u] and Zi[v\ for 
(u,v) G group as Wi[u,v] = — oo. Such connections are 
shown as red edges in Fig. [2] For disjoint groups (say, non¬ 
overlapping 2x2 windows), layer-wise updates correspond 
to filtering, rectification 0. and non-maximal suppression 
(NMS) within each group. 

Unlike max-pooling, NMS encodes the spatial location 
of the max by returning 0 values for non-maximal loca¬ 
tions. Standard max-pooling can be obtained as a special 
case by replicating filter weights 1 across variables Zi 
within the same group (as shown in Fig. [2]). This makes 
NMS independent of the top-down signal topi. However, 
our approach is more general in that NMS can be guided 
by top-down feedback: high-level variables (e.g., car detec¬ 
tions) influence the spatial location of low-level variables 
(e.g., wheels), which is particularly helpful when parsing 
occluded wheels. Interestingly, top-down feedback seems 
to encode spatial information without requiring additional 
“capsule” variables 0 

Approximate inference: Given the above global scor¬ 
ing function and an image x, inference corresponds to 
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argmax^ S(x, z). As argued above, this can be imple¬ 
mented with an infinitely-deep unrolled recurrent CNN. 
However, rather than optimizing the latent variables to com¬ 
pletion, we perform a fixed number ( k ) of layer-wise coor¬ 
dinate descent updates. This is guaranteed to report back 
finite variables z* for any weight matrix W (even when not 
copositive): 

= QPfcOr, w, b ), e R n (5) 

We write QPfc in bold to emphasize that it is a vector¬ 
valued function implementing k passes of layer-wise coor¬ 
dinate descent on the QP from ([2]), returning a vector of all 
N latent variables. We set k = 1 for a single bottom-up 
pass (corresponding to a standard feed-forward CNN) and 
k = 2 for an additional top-down pass. We visualize exam¬ 
ples of recurrent CNNs that implement QP X and QP 2 in 

Pig-0 

Output prediction: We will use these N variables as 
features for M recognition tasks. In our experiments, we 
consider the task of predicting heatmaps for M keypoints. 
Because our latent variables serve as rich, multi-scale de¬ 
scription of image features, we assume that simple linear 
predictors built on them will suffice: 

y = V T z*, y e R m ,V e R NxM (6) 

Training: Our overall model is parameterized by 
(W, V , b). Assume we are given training data pairs of im¬ 
ages and output label vectors yi). We define a training 
objective as follows 

min R(W) + R(V) + ^ loss(y i; V T QP k ( Xi , W, b )) 

i 

(7) 

where R are regularizer functions (we use the Frobenius 
matrix norm) and “loss” sums the loss of our M prediction 
tasks (where each is scored with log or softmax loss). We 
optimize the above by stochastic gradient descent. Because 
Q p /c is a deterministic function, its gradient with respect 
to (W, b ) can be computed by backprop on the /c-times un¬ 
rolled recurrent CNN (Fig. [3]). We choose to separate V 
from W to ensure that feature extraction does not scale with 
the number of output tasks (QP^ is independent of M). 
During learning, we fix diagonal weights (wi[u,u] = — 1) 
and lateral inhibition weights (wi[u, v] = —oo for (u, v ) G 
group). 

Related work (learning): The use of gradient-based 
backpropagation to learn an unrolled model dates back to 
‘backprop-through-structure’ algorithms (Tl][4()| and graph 
transducer networks (27| . More recently, such approaches 
were explored general graphical models 6D and Boltz¬ 
mann machines GU Our work uses such ideas to learn 
CNNs with top-down feedback using an unrolled latent- 
variable model. 


Related work (top-down): Prior work has explored 
networks that reconstruct images given top-down cues. 
This is often cast as unsupervised learning with autoen¬ 
coders |16|32|49) or deconvolutional networks (53) , though 
supervised variants also exist (29[[34) . Our network dif¬ 
fers in that all nonlinear operations (rectification and max¬ 
pooling) are influenced by both bottom-up and top-down 
knowledge <0, which is justified from a latent-variable per¬ 
spective. 

3. Implementation 

In this section, we provide details for implementing 
QPi and QP 2 with existing CNN toolboxes. We visual¬ 
ize our specific architecture in Fig. [4] which closely follows 
the state-of-the-art VGG-16 network (38) . We use 3x3 fil¬ 
ters and 2x2 non-overlapping pooling windows (for NMS). 
Note that, when processing NMS-layers, we conceptually 
use 6x6 filters with replication after NMS, which in practice 
can be implemented with standard max-pooling and 3x3 fil¬ 
ters (as argued in the previous section). Hence QP X is es¬ 
sentially a re-implementation of VGG-16. 

QP 2 : Fig. [5] illustrates top-down coordinate updates, 
which require additional feedforward layers, skip connec¬ 
tions, and tied weights. Even though QP 2 is twice as deep 
as QP X (and (38)), it requires no additional parameters. 
Hence top-down reasoning “comes for free”. There is a 
small notational inconvenience at layers that decrease in 
size. In typical CNNs, this decrease arises from a previ¬ 
ous pooling operation. Our model requires an explicit 2 x 
subsampling step (sometimes known as strided filtering) be¬ 
cause it employs NMS instead of max-pooling. When this 
subsampled layer is later used to produce a top-down sig¬ 
nal for a future coordinate update, variables must be zero- 
interlaced before applying the 180° rotated convolutional 
filters (as shown by hollow circles in Fig. [5]). Note that is 
not an approximation, but the mathematically-correct appli¬ 
cation of coordinate descent given subsampled weight con¬ 
nections. 

Supervision y: The target label for a single keypoint is 
a sparse 2D heat map with a 4 V at the keypoint location (or 
all ‘0’s if that keypoint is not visible on a particular training 
image). We score this heatmap with a per-pixel log-loss. 
In practice, we assign ‘l’s to a circular neighborhood that 
implicitly adds jittered keypoints to the set of positive ex¬ 
amples. 

Multi-scale classifiers V: We implement our output 
classifiers 0 as multi-scale convolutional filters defined 
over different layers of our model. We use upsampling to 
enable efficient coarse-to-fine computations, as described 
for fully-convolutional networks (FCNs) (29| (and shown 
in Fig. [4]). Specifically, our multi-scale filters are imple¬ 
mented as 1 x 1 filters over 4 layers (referred to as fc7, 
pool4, pool3, and pool2 in (38)). Because our top (fc7) 
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Figure 4: We show the architecture of QP 2 implemented in our experiments. QP X corresponds to the left half of QP 2 , which 
essentially resembles the state-of-the-art VGG-16 CNN (38) . QP 2 is implemented with an 2X “unrolled” recurrent CNN 
with transposed weights, skip connections, and zero-interlaced upsampling (as shown in Fig. [5]). Importantly, QP 2 does not 
require any additional parameters. Red layers include lateral inhibitory connections enforced with NMS. Purple layers denote 
multi-scale convolutional filters that (linearly) predict keypoint heatmaps given activations from different layers. Multi-scale 
filters are efficiently implemented with coarse-to-fine upsampling (29) , visualized in the purple dotted rectangle (to reduce 
clutter, we visualize only 3 of the 4 multiscale layers). Dotted layers are not implemented to reduce memory. 
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Figure 5: Two-pass layer-wise coordinate descent for a two- 
layer Rectified Gaussian model can be implemented with 
modified CNN operations. White circles denote O’s used 
for interlacing and border padding. We omit rectification 
operations to reduce clutter. We follow the color scheme 
from Fig. [2] 


layer is limited in spatial resolution (1x1x4096), we de¬ 
fine our coarse-scale filter to be “spatially-varying”, which 
can alternatively be thought of as a linear “fully-connected” 
layer that is reshaped to predict a coarse (7x7) heatmap of 
keypoint predictions given fc7 features. Our intuition is that 
spatially-coarse global features can still encode global con¬ 
straints (such as viewpoints) that can produce coarse key- 
point predictions. This coarse predictions are upsampled 
and added to the prediction from pool4, and so on (as in 

(D). 


Multi-scale training: We initialize parameters of both 
QP X and QP 2 to the pre-trained VGG-16 model (38) , 
and follow the coarse-to-fine training scheme for learning 
FCNs [29]. Specifically, we first train coarse-scale filters, 
defined on high-level (fc7) variables. Note that QP X and 
QP 2 are equivalent in this setting. This coarse-scale model 
is later used to initialize a two-scale predictor, where now 
QP i and QP 2 differ. The process is repeated up until the 
full multi-scale model is learned. To save memory during 
various stages of learning, we only instantiate QP 2 up to 
the last layer used by the multi-scale predictor (not suitable 
for QP fc when k > 2). We use a batch size of 40 images, 
a fixed learning rate of 10 -6 , momentum of 0.9 and weight 
decay of 0.0005. We also decrease learning rates of param¬ 
eters built on lower scales (29) by a factor of 10. Batch 
normalization (18) is used before each non-linearity. Both 
our models and code are available online 0 

Prior work: We briefly compare our approach to re¬ 
cent work on keypoint prediction that make use of deep 
architectures. Many approaches incorporate multi-scale 
cues by evaluating a deep network over an image pyra¬ 
mid (44||46||47) . Our model processes only a single image 
scale, extracting multi-scale features from multiple layers of 
a single network, where importantly, fine-scale features are 
refined through top-down feedback. Other approaches cast 
the problem as one of regression, where (x,y) keypoint loca¬ 
tions are predicted (54} and often iteratively refined (3}|42). 
Our models predict heatmaps, which can be thought of as 
marginal distributions over the (x,y) location of a keypoint, 

1 https://github.com/peiyunh/rg-mpii 
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capturing uncertainty. We show that by thresholding the 
heatmap value (certainty), one can also produce key point 
visibility estimates “for free”. Our comments hold for our 
bottom-up model QP 1? which can be thought of as a FCN 
tuned for keypoint heatmap prediction, rather than seman¬ 
tic pixel labeling. Indeed, we find such an approach to be a 
surprisingly simple but effective baseline that outperforms 
much prior work. 

4. Experiment Results 

We evaluated fine-scale keypoint localization on several 
benchmark datasets of human faces and bodies. To bet¬ 
ter illustrate the benefit of top-down feedback, we focus on 
datasets with significant occlusions, where bottom-up cues 
will be less reliable. All datasets provide a rough detec¬ 
tion window for the face/body of interest. We crop and re¬ 
size detection windows to 224x224 before feeding into our 
model. Recall that QP X is essentially a re-implementation 
of a FCN [291 defined on a VGG-16 network [38], and so 
represents quite a strong baseline. Also recall that QP 2 
adds top-down reasoning without any increase in the num¬ 
ber of parameters . We will show this consistently improves 
performance, sometimes considerably. Unless otherwise 
stated, results are presented for a 4-scale multi-scale model. 

AFLW: The AFLW dataset (24j is a large-scale real- 
world collection of 25,993 faces in 21,997 real-world im¬ 
ages, annotated with facial keypoints. Notably, these faces 
are not limited to be responses from an existing face detec¬ 
tor, and so this dataset contains more pose variation than 
other landmark datasets. We hypothesized that such pose 
variation might illustrate the benefit of bidirectional rea¬ 
soning. Due to a lack of standard splits, we randomly 
split the dataset into training (60%), validation (20%) and 
test (20%). As this is not a standard benchmark dataset, 
we compare to ourselves for exploring the best practices 
to build multi-scale predictors for keypoint localization 
(Fig. [7]). We include qualitative visualizations in Fig. [6] 

COFW: Caltech Occluded Faces-in-the-Wild 

(COFW) (2) is dataset of 1007 face images with se¬ 
vere occlusions. We present qualitative results in Figj8]and 
Fig. [9| and quantitative results in Table [T] and Fig. [T0| Our 
bottom-up QP X already performs near the state-of-the-art, 
while the QP 2 significantly improves in accuracy of visible 
landmark localization and occlusion prediction. In terms 
of the latter, our model even approaches upper bounds that 
make use of ground-truth segmentation labels (7). Our 
models are not quite state-of-the-art in localizing occluded 
points. We believe this may point to a limitation in the 
underlying benchmark. Consider an image of a face mostly 
occluded by the hand (Fig. g]>. In such cases, humans 
may not even agree on keypoint locations, indicating that 
a keypoint distribution may be a more reasonable target 
output. Our models provide such uncertainty estimates, 



(a) (b) 


Figure 6: Facial landmark localization results of QP 2 on 
AFLW, where landmark ids are denoted by color. We only 
plot landmarks annotated visible. Our bidirectional model 
is able to deal with large variations in illumination, appear¬ 
ance and pose (a). We show images with multiple chal¬ 
lenges present in (b). 



Figure 7: We plot the fraction of recalled face images whose 
average pixel localization error in AFLW (normalized by 
face size J56| ) is below a threshold (x-axis). We compare 
our QP i and QP 2 with varying numbers of scales used 
for multi-scale prediction, following the naming convention 
of FCN f29| (where the Nx encodes the upsampling factor 
needed to resize the predicted heatmap to the original image 
resolution.) Single-scale models (QP x -32x and QP 2 -32x) 
are identical but perform quite poorly, not localizing any 
keypoints with 3.0% of the face size. Adding more scales 
dramatically improves performance, and moreover, as we 
add additional scales, the relative improvement of QP 2 also 
increases (as finer-scale features benefit the most from feed¬ 


back). We visualize such models in Fig. 12 
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Figure 8: Visualization of keypoint predictions by QP, and 
QP 2 on two example COFW images. Both our models pre¬ 
dict both keypoint locations and their visibility (produced 
by thresholding the value of the heatmap confidence at the 
predicted location). We denote (in)visible keypoint predic¬ 
tions with (red)green dots, and also plot the raw heatmap 
prediction as a colored distribution overlayed on a darkened 
image. Both our models correctly estimate keypoint visibil¬ 
ity, but our bottom-up models QP X misestimate their loca¬ 
tions (because bottom-up evidence is misleading during oc¬ 
clusions). By integrating top-down knowledge (perhaps en¬ 
coding spatial constraints on configurations of keypoints), 
QP 2 is able to correctly estimate their locations. 



(a) (b) 


Figure 9: Facial landmark localization and occlusion pre¬ 
diction results of QP 2 on COFW, where red means oc¬ 
cluded. Our bidirectional model is robust to occlusions 
caused by objects, hair, and skin. We also show cases where 
the model correctly predicts visibility but fails to accurately 
localize occluded landmarks (b). 


while most keypoint architectures based on regression 
cannot. 

Pascal Person: The Pascal 2011 Person dataset m 
consists of 11,599 person instances, each annotated with a 



Visible Points 

All Points 

RCPR | 2 | 

- 

8.5 

Rpp ||yi 

- 

7.52 

HPMfl 

- 

7.46 

SAPM jm 

5.77 

6.89 

FLD-Full po) 

5.18 

5.93 

QP i 

5.26 

10.06 

QP 2 

4.67 

7.87 


Table 1: Average keypoint localization error (as a fraction 
of inter-ocular distance) on COFW. When adding top-down 
feedback (QP 2 ), our accuracy on visible keypoints signifi¬ 
cantly improves upon prior work. In the text, we argue that 
such localization results are more meaningful than those for 
occluded keypoints. In Fig. [lOj we show that our models 
significantly outperform all prior work in terms of keypoint 
visibility prediction. 
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Figure 10: Keypoint visibility prediction on COFW, mea¬ 
sured by precision-recall. Our bottom-up model QP X al¬ 
ready outperforms all past work that does not make use of 
ground-truth segmentation masks (where acronyms corre¬ 
spond those in Table [T]). Our top-down model QP 2 even 
approaches the accuracy of such upper bounds. Follow¬ 
ing standard protocol, we evaluate and visualize accuracy 
in Fig. [9] at a precision of 80%. At such a level, our 
recall (76%) significantly outperform the best previously- 
published recall of FLD (50| (49%). 


bounding box around the visible region and up to 23 hu¬ 
man keypoints per person. This dataset contains signifi¬ 
cant occlusions. We follow the evaluation protocol of (30) 
and present results for localization of visible keypoints on 
a standard testset in Table [2] Our bottom-up QP X model 
already significantly improves upon the state-of-the-art (in¬ 
cluding prior work making use of deep features), while our 
top-down models QP 2 further improve accuracy by 2% 
without any increase in model complexity (as measured by 
the number of parameters). Note that the standard evalua- 
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Figure 11: Keypoint visibility prediction on Pascal Person 
(a dataset with significant occlusion and truncation), mea¬ 
sured by precision-recall curves. At 80% precision, our top- 
down model ( QP 2 ) significantly improves recall from 65% 
to 85%. 


a 

0.10 

0.20 

CNN+prior po) 

47.1 

- 

QPi 

66.5 

78.9 

qp 2 

68.8 

80.8 


Table 2: We show human keypoint localization performance 
on PASCAL VOC 2011 Person following the evaluation 
protocol in (30) . PCK refers to the fraction of keypoints 
that were localized within some distance (measured with re¬ 
spect to the instance’s bounding box). Our bottom-up mod¬ 
els already significantly improve results across all distance 
thresholds (a = 10, 20%). Our top-down models add a 2% 
improvement without increasing the number of parameters. 


tion protocols evaluate only visible keypoints. In Fig. mi 
we demonstrate that our model can also accurately predict 
keypoint visibility “for free”. 


MPII: MPII is (to our knowledge) the largest available 
articulated human pose dataset |[l}, consisting of 40,000 
people instances annotated with keypoints, visibility flags, 
and activity labels. We present qualitative results in Fig. |T4] 
and quantitative results in Table [3] Our top-down model 
QP 2 appears to outperform all prior work on full-body key- 
points. Note that this dataset also includes visibility labels 
for keypoints, even though these are not part of the stan¬ 
dard evaluation protocol. In Fig. 13 we demonstrate that 
visibility prediction on MPII also benefits from top-down 
feedback. 


TB: It is worth contrasting our results with TB (45) , 
which implicitly models feedback by (1) using a MRF to 
post-process CNN outputs to ensure kinematic consistency 
between keypoints and (2) using high-level predictions from 
a coarse CNN to adaptively crop high-res features for a fine 


Figure 13: Keypoint visibility prediction on MPII, mea¬ 
sured by precision-recall curves. At 80% precision, our top- 
down model ( QP 2 ) improves recall from 44% to 49%. 



Figure 14: Keypoint localization results of QP 2 on the 
MPII Human Pose testset. We quantitatively evaluate re¬ 
sults on the validation set in Table [2] Our models are able to 
localize keypoints even under significant occlusions. Recall 
that our models can also predict visibility labels “for free”, 
as shown in Fig. 


13 


CNN. Our single CNN endowed with top-down feedback 
is slightly more accurate without requiring any additional 
parameters, while being 2X faster (86.5 ms vs TB’s 157.2 
ms). These results suggest that top-down reasoning may el¬ 
egantly capture structured outputs and attention, two active 
areas of research in deep learning. 

More recurrence iterations: To explore QP^’s per¬ 
formance as a function of K without exceeding memory 
limits, we trained a smaller network from scratch on 56X56 
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Figure 12: We visualize bottom-up and top-down models trained for human pose estimation, using the naming convention 
of Fig. 0 Top-down feedback (QP 2 ) more accurately guides finer-scale predictions, resolving left-right ambiguities in the 
ankle (red) and poor localization of the knee (green) in the bottom-up model (QP X ). 



Head 

Shou 

Elb 

Wri 

Hip 

Kne 

Ank 

Upp 

Full 

GM [ io; 

- 

36.3 

26.1 

15.3 

- 

- 

- 

25.9 

- 

ST |37f 

- 

38.0 

26.3 

19.3 

- 

- 

- 

27.9 

- 

YR [ 52; 

73.2 

56.2 

41.3 

32.1 

36.2 

33.2 

34.5 

43.2 

44.5 

PS [35T 

74.2 

49.0 

40.8 

34.1 

36.5 

34.4 

35.1 

41.3 

44.0 

TB [45 

96.1 

91.9 

83.9 

77.8 

80.9 

72.3 

64.8 

84.5 

82.0 

QPi 

94.3 

90.4 

81.6 

75.2 

80.1 

73.0 

68.3 

82.4 

81.1 

QP 2 

95.0 

91.6 

83.0 

76.6 

81.9 

74.5 

69.5 

83.8 

82.4 


Table 3: We show PCKh-0.5 keypoint localization results 
on MPII using the recommended benchmark protocol [ 1 ]. 


K 

1 

2 

3 

4 

5 

6 

Upper Body 

57.8 

59.6 

58.7 

61.4 

58.7 

60.9 

Full Body 

59.8 

62.3 

61.0 

63.1 

61.2 

62.6 


Table 4: PCKh(.5) on MPII-Val for a smaller network 


sized inputs for 100 epochs. As shown in Table |4| we con¬ 
clude: (1) all recurrent models outperform the bottom-up 
baseline QP X ; (2) additional iterations generally helps, but 
performance maxes out at QP 4 . A two-pass model (QP 2 ) 
is surprisingly effective at capturing top-down info while 
being fast and easy to train. 

Conclusion: We show that hierarchical Rectified Gaus¬ 
sian models can be optimized with rectified neural net¬ 
works. From a modeling perspective, this observation al¬ 
lows one to discriminatively-train such probabilistic models 
with neural toolboxes. From a neural net perspective, this 
observation provides a theoretically-elegant approach for 
endowing CNNs with top-down feedback - without any in¬ 
crease in the number of parameters. To thoroughly evaluate 
our models, we focus on “vision-with-scrutiny” tasks such 
as keypoint localization, making use of well-known bench¬ 
mark datasets. We introduce (near) state-of-the-art bottom- 
up baselines based on multi-scale prediction, and consis¬ 


tently improve upon those results with top-down feedback 
(particularly during occlusions when bottom-up evidence 
may be ambiguous). 
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